Intro [Spark & Flink for Real time Big Data Processing]

💡

This article is a summary of "Spark & Flink Online for Real-Time Big Data Processing" from Fast Campus. It summarizes the overall definition and recommends taking the lecture for programming related content.

Introduction to Data Engineering and its Tools

Keon Kim, an expert in data engineering, recently gave a lecture introducing the field of data engineering and its key tools such as Spark, Kafka, and Flink. He began by discussing his own experience at Uber where he worked on various projects aimed at optimizing customer experience and driver incentives.

Keon went on to explain that data engineering is the process of transforming and organizing data, and that stream processing is a major focus of the field. The lecture then covered the various tools required for big data processing and parallel processing.

Keon emphasized that the purpose of data engineering is to create an infrastructure for data-based decisions. He explained that there are two types of data-based decisions, unique business decisions and unique decisions for service operation and improvement. Keon stressed that data engineering is a field felt by all companies, regardless of sector, and therefore presents a great opportunity for startups.

He provided a number of examples of successful startups in the field, including Snowflake, Confluent, Databricks, Segments, and Dremu. These companies have all successfully solved big data problems and grown into large, successful enterprises.

Keon also highlighted that data-based decision-making is becoming increasingly important, with 60% of Fortune companies employing a Chief Data Officer. He explained that companies that hire CDOs tend to perform better than those that do not. Therefore, data engineering and data analysis are among the fastest-growing job groups.

Finally, Keon discussed a few examples of data-driven decisions, including Netflix, which uses data to personalize recommendations, and Bratish Patiorium, which used data to optimize production and reduce methane emissions.

In summary, Keon’s lecture provides an excellent introduction to data engineering and its key tools, and highlights the importance of data-based decision-making in today's business world.

Understanding Data Infrastructure Trends

With the advancements in technology and the increasing importance of data analysis in businesses, the world of data engines is constantly evolving. In this article, we will dive into the infrastructure trends in the data industry.

In the past, computing power and storage were expensive, and data management involved creating a schema in advance to manage the format of the data. Once the schema was created, it was difficult to make any changes, and the database modeling process was crucial. The ETL (Extract, Transform, Load) process was used to process data in this environment. However, as the complexity of data processing increased, it became difficult to determine a schema. Computing power became cheaper, and it became possible to store as much data as possible in advance. Hence, it is now more profitable for companies to focus on business and speed than cost gymnastics for computing power.

Today, the ELT (Extract, Load, Transform) method is replacing the traditional ETL method. The ELT method extracts data, stores it once, and then goes through a T section that changes according to the swimmer. Depending on the complexity of the system, data extraction and loading may be performed at once, and after the EL process of organizing and storing the data to some extent through Spark or Flink, it is converted to be used later as an application or analysis tool.

Like all other data infrastructures, data engines are moving to the cloud. Cloud data warehouses are trending a lot, and in the process, solutions such as Snowflake or Google BigQuery are being used. There is also a trend to move to the next generation such as Hadoop, Databricks, and Presto, and the demand for real-time big data processing is increasing. In addition, data platform management is increasingly centralized. Such tools, such as XS Control and Databokki, which manage Skillma, are gradually evolving.

Data architecture can be divided into six major categories: source fields, vertical and changing fields, data that can be used by the career learning system, the magnetic field that encourages the future, the past field that curries past data, and the prediction field that predicts the future. These categories provide data outputs to both internal and external users. Data analysis can be traced from data creation to application in these six areas.

The general engine in data processing is focused on young verticals and transformations, and data processing. Popular tools in the industry include Airflow, Kafka, and Spark Plink in the vertical and changed fields. In the field of irony, if one is interested in cool learning, Turi such as TensorFlow and PyTorch are included. Spark or Spark Am and Flink are used for data analysis.

In summary, as data processing becomes increasingly complex, there is a shift from the ETL to the ELT method, a move to the cloud, and a demand for real-time big data processing. Centralized data platform management and the use of automation tools are evolving in the industry. The future of data infrastructure is focused on young verticals and transformations and data processing.

Batch and Stream Processing

In this article, we discussed the simple functions of batch and stream processing. Here are the key takeaways in bullet point style:

Batch Processing

Processing a large amount of data at once in a predefined time.

Limited amount of data is required and processed according to a specific time.

Used when heavy processing is required and there is no need for real-time processing.

Examples include supply and demand forecasts, sending daily marketing emails, and monthly salary payments.

Stream Processing

Continuous processing of data poured out over time.

Processes data whenever an event occurs.

Used when realization must be guaranteed and data comes in from multiple sources occasionally or continuously.

Examples include real-time function of military grade work and predicting wheat price drop.

Micro-batch Processing

A method of batch processing by collecting data little by little.

Used to break up streaming by breaking batch processing into small pieces.

Example of Mixing Batch and Stream Processing

Project called Michelangelo, which was a platform to continue and disseminate running algorithms.

Online part had real-time data streaming and a log that stores data, while offline data was organized through a batch pipeline that processes data at once in the past.

Learned models were distributed for batch processing and real-time online services.

Models deployed on both sides were used to predict delivery time and supply for weekly demand.

In summary, batch processing is used for heavy processing with a predefined time while stream processing is used for continuous processing in real-time. Micro-batch processing is a method of breaking up streaming by breaking batch processing into small pieces. Mixing batch and stream processing can lead to effective use of resources and effective predictions.

Dataflow Workflow Orchestration

Dataflow workflow orchestration involves scheduling, distributed execution, and managing the heritability between tasks. The complexity of data platforms increases as services grow, and the demand for managing tasks and their heritability also increases.

Without a workflow orchestration system, broken workflows can hinder real-time service. In cases like these, engineers spend time figuring out the cause and running tests again. However, with a data orchestration system, tasks can be realized again according to a pre-written scenario, and the whole workflow can succeed.

One example of a data orchestration system is Apartment's Fall Flow, which supports a dashboard for managing data tasks and workflows. This system helps simplify data tasks and manage workflows, ensuring smooth and efficient service.

Key points:

Dataflow workflow orchestration involves scheduling, distributed execution, and managing task heritability.

Without workflow orchestration, broken workflows can hinder real-time service and take time to fix.

Data orchestration systems like Apartment's Fall Flow help manage data tasks and workflows efficiently.

Fall Flow supports a dashboard to manage data tasks and workflows in real-time.